[rag evals][1/n] refactor base scoring fn & data schema check #664

yanxi0830 · 2024-12-19T22:14:28Z

What does this PR do?

See [rag evals][2/n] add more braintrust scoring fns for RAG eval #666 & [rag evals][3/n] add ability to eval retrieval + generation in agentic eval pipeline #668
Refactor BaseScoringFn to be just a minimal interface, add new RegistrableBaseScoring
Refactor data schema check
- To separately evaluate retrieval component in RAG, we will have scoring functions needing "context" column additionally.
Refactor braintrust eval (more scoring fn added & tested in following PR)

Test Plan

pytest -v -s -m llm_as_judge_scoring_together_inference scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py

pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Ran pre-commit to handle lint / formatting issues.
Read the contributor guideline,
Pull Request section?
Updated relevant documentation.
Wrote necessary unit or integration tests.

llama_stack/providers/utils/common/data_schema_validator_mixin.py

ashwinb · 2024-12-21T02:52:24Z

llama_stack/providers/utils/scoring/base_scoring_fn.py

@@ -13,12 +13,51 @@

 class BaseScoringFn(ABC):


I still don't understand why we have these base classes because we have already declared what our impls need to do in terms of datatypes in our APIs. so the datatype for a scoring function already exists. Let's say I am implementing a new scoring function -- why do I need another base class and inherit from there? If there is some utilities I need for implementing the functions, they would be just utils / free functions or in the worst case, some mixins.

Can you explain the need for base classes please? I am very allergic to inheritance as you and @raghotham knows :)

This class BaseScoringFn(ABC): is mostly separated out based on feedback from Tejas for his use case without
registration (syncing with you offline).

For our llama-stack implementations, I agree we don't need this separate BaseScoringFn, and could just use RegisteredBaseScoringFn as mixins.

@yanxi0830 I see, interesting. Thanks for the explanation. We can keep this as is for now.

raghotham · 2024-12-27T01:44:59Z

llama_stack/providers/utils/common/data_schema_validator_mixin.py

@@ -0,0 +1,93 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.


Can datasetio abstract schema validation instead of creating mixins to be used by scoring and evals?

@raghotham can you explain what you mean by this with an example? how do you imagine this validation happening specifically?

ashwinb

lg

# What does this PR do? - add more braintrust scoring functions for RAG eval - add tests for evaluating against context ## Test Plan ``` pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py ``` <img width="850" alt="image" src="https://github.com/user-attachments/assets/2f8f0693-ea13-422c-a183-f798faf86433" /> **Example Output** - https://gist.github.com/yanxi0830/2acf3b8b3e8132fda2a48b1f0a49711b <img width="827" alt="image" src="https://github.com/user-attachments/assets/9014b957-107c-4c23-bbc0-812cbd0b16da" /> <img width="436" alt="image" src="https://github.com/user-attachments/assets/21e9da17-f426-49b2-9113-855cab7b3d40" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.

yanxi0830 added 3 commits December 19, 2024 11:49

refactor base scoring fn v.s. registerable scoring fn

0096c1a

refactor base scoring fn v.s. registerable scoring fn

199f92d

Merge branch 'main' into rag_scoring_fn_1

13720cb

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 19, 2024

yanxi0830 added 2 commits December 19, 2024 14:26

scoring

1094f26

refactor schema check

55e4f4e

yanxi0830 changed the title ~~[rag eval][1/n] refactor base scoring fn & add more braintrust evaluators~~ [rag eval][1/n] refactor base scoring fn & data schema check Dec 19, 2024

yanxi0830 marked this pull request as ready for review December 20, 2024 00:10

yanxi0830 requested review from ashwinb, hardikjshah, dltn, raghotham, dineshyv and vladimirivic as code owners December 20, 2024 00:10

yanxi0830 added 3 commits December 19, 2024 16:20

refactor schema check

c15b0d5

clean up

c4af8f8

clean up

b94ab8d

yanxi0830 commented Dec 20, 2024

View reviewed changes

llama_stack/providers/utils/common/data_schema_validator_mixin.py Outdated Show resolved Hide resolved

yanxi0830 changed the title ~~[rag eval][1/n] refactor base scoring fn & data schema check~~ [rag evals][1/n] refactor base scoring fn & data schema check Dec 20, 2024

ashwinb reviewed Dec 21, 2024

View reviewed changes

raghotham reviewed Dec 27, 2024

View reviewed changes

yanxi0830 added 3 commits December 30, 2024 17:20

Merge branch 'main' into rag_scoring_fn_1

d62f104

precommit

3367c52

refactor schema check

eb92322

ashwinb approved these changes Jan 2, 2025

View reviewed changes

yanxi0830 merged commit 3a269c4 into main Jan 2, 2025
2 checks passed

yanxi0830 deleted the rag_scoring_fn_1 branch January 2, 2025 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rag evals][1/n] refactor base scoring fn & data schema check #664

[rag evals][1/n] refactor base scoring fn & data schema check #664

yanxi0830 commented Dec 19, 2024 •

edited

Loading

ashwinb Dec 21, 2024

yanxi0830 Dec 26, 2024

ashwinb Dec 31, 2024

raghotham Dec 27, 2024

ashwinb Dec 31, 2024

ashwinb left a comment

		@@ -0,0 +1,93 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.

[rag evals][1/n] refactor base scoring fn & data schema check #664

[rag evals][1/n] refactor base scoring fn & data schema check #664

Conversation

yanxi0830 commented Dec 19, 2024 • edited Loading

What does this PR do?

Test Plan

Before submitting

ashwinb Dec 21, 2024

Choose a reason for hiding this comment

yanxi0830 Dec 26, 2024

Choose a reason for hiding this comment

ashwinb Dec 31, 2024

Choose a reason for hiding this comment

raghotham Dec 27, 2024

Choose a reason for hiding this comment

ashwinb Dec 31, 2024

Choose a reason for hiding this comment

ashwinb left a comment

Choose a reason for hiding this comment

yanxi0830 commented Dec 19, 2024 •

edited

Loading